Because the file of database is given, we choose file of sql to load data more effectively and efficiently.
To use the file of sql, it is necessary to establish a database in localhost or AWS at first. After that, we used workbench to load file of sql into the database we established.
import pymysql
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
Connecting to the database
mydb = pymysql.connect(user='Lan', password='12345678',
host='myyelpdb.caenuubtjpww.us-east-1.rds.amazonaws.com',port=3309,database='yelp_db')
mydb = pymysql.connect(user='root', password='G42955089',
host='localhost',database='yelp_db')
cursor = mydb.cursor()
First of all, we aim to take a look at locations of business in our data.
sql='''
SELECT state, COUNT(state) AS COUNT, ROUND(AVG(stars),2) AS mean
FROM business
GROUP BY state
ORDER BY COUNT DESC;
'''
state = pd.read_sql(sql,mydb)
state.head()
For visualization, we use plotly to draw a map of location of our business.
import plotly
import plotly.graph_objs as go
for col in state.columns:
state[col] = state[col].astype(str)
#To show average stars of business in map
state['text']=state['state'] + '<br>' \
+'Average of stars: ' + state['mean']
#To custome the color of our map
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
trc = dict(
type = 'choropleth',
colorscale = scl,
locations = state['state'],
z = state['COUNT'].astype(float),
text = state['text'],
locationmode = 'USA-states',
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
) ))
lyt = dict(
title = 'Num. of Business By State',
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)')
)
plotly.offline.init_notebook_mode(connected=True)
map=go.Figure(data=[trc],layout=lyt)
plotly.offline.iplot(map)

The map above shows that most of our data are based on business in West USA. Therefore, we choose restaurants in the west part as our story background and we would use data in west part which is more supportive.
Focusing on the performance of restaurants in west part of US, we created a table named restaurants only with data of restaurants in the west, for convinience of later query.
sql='''
DROP TABLE IF EXISTS restaurants;
'''
cursor.execute(sql)
sql='''
CREATE TABLE restaurants(
id VARCHAR(50) PRIMARY KEY,
name VARCHAR(250),
city VARCHAR(50) ,
state VARCHAR(10) ,
postal_code VARCHAR(250),
latitude FLOAT,
longitude FLOAT,
stars FLOAT NOT NULL,
review_count INT ,
is_open INT
)
'''
cursor.execute(sql)
sql='''
INSERT INTO restaurants
SELECT id, name, city, state,postal_code,latitude, longitude, stars, review_count,is_open
FROM business
JOIN category
ON id=category.business_id
AND category="Restaurants"
AND (business.state='AZ' or business.state='NV' or business.state='CA' or business.state='WA');
'''
cursor.execute(sql)
mydb.commit()
sql = '''
SELECT COUNT(*) FROM restaurants;
'''
res = pd.read_sql(sql,mydb)
res
In our discussion, it was expected that higher stars of restaurants would attract more customers. Therefore, the data and plot below is to verify our expectation.
data = pd.read_sql('select stars, count, review_count from restaurants join checkin on restaurants.id=checkin.business_id;', mydb)
import matplotlib.pyplot as plt
from matplotlib.pyplot import *
import seaborn as sns
pd.plotting.scatter_matrix(data.loc[:, ["stars","count","review_count"]], diagonal="kde")
plt.figure(figsize = [12,12])
plt.show()
# To calculate the correlation between every two variables
corr = data.corr(method="pearson")
sns.heatmap(corr)
corr
sns.heatmap(corr)
plt.show()
So basically, the first graph is showing us the relationship between each two of the variables like starts and count, count and review count, and stars and review count. For stars and count, as we can see from the graph, the majority star ratings are distributed at 2.5 to 4.5 stars. Among which, 3.5-star has the greatest rating counts. Almost the same thing happens to the relationship between stars and review_count, the majority reviews fall into range of 3 to 4.5, 4-star has the greatest review counts. For relationship between counts and review-count, review-counts under 4,000 are somehow evenly distributed, when revew-count exceed 4,000 some gaps appear.
From the correlation heatmap, we can see that the darker the color, the stronger the lorrelation. There are 4 very light squares in the graph, stars and count has the lighter color, which means they have the weakest relationship of 0.15347. Stars and review-count has a little bit stronger relationship of 0.215748. Count and review-count has the strongest relationship of 0.529276.
Though the correlation between stars and checkin count is not high, most restaurants with high checkin count or review count are high stars, which shows importance of high stars.
In this part, the bar chart would be used to explore the cetegories of restaurants and we wonder what kinds of categories account for larger proportion.
sql='''
SELECT category,COUNT(*) AS Count
FROM category
JOIN restaurants
ON category.business_id = restaurants .id
GROUP BY category
ORDER BY Count DESC
LIMIT 20;
'''
cate_count = pd.read_sql(sql,mydb)
# The first row is restaurants so we delete it.
cate_count=cate_count[1:]
import numpy as np
plt.clf()
fig = plt.figure(figsize=[20,8])
ax2=fig.add_subplot(1,1,1)
ax2.set(title = 'Number Of Categories of Business',xlabel='Categories',ylabel='Number')
# Draw the bar chart
_ = ax2.bar(np.arange(len(cate_count)), height=np.array(cate_count['Count']),width=0.5,color='lightblue')
# Set the X label
ax2.set_xticks(np.arange(len(cate_count)))
ax2.set_xticklabels(cate_count['category'])
#Rotate the X label
for tick in ax2.get_xticklabels():
tick.set_rotation(30)
The chart below indicates that top 5 categories are Food, Fast Food, Mexican, American(Traditional) and Sandwiches. In fact, our groups also feels that there are many restaurants of these cateories around us.
plt.show()
#put the data into pandas dataframe
restaurant = pd.read_sql('select * from restaurants',mydb)
#prepare for the count
first = 0 #first means the first segment for the pie chart.
second = 0
third = 0
forth = 0
fifth = 0
#start to count
for j in range(0,17103):
if restaurant.review_count[j] <= 30:
first = first + 1
elif restaurant.review_count[j] <= 100:
second = second + 1
elif restaurant.review_count[j] <= 250:
third = third + 1
elif restaurant.review_count[j] <= 500:
forth = forth + 1
else:
fifth = fifth + 1
#use plotly to do offline drawing at notebook
plotly.offline.init_notebook_mode(connected=True)
#create labels and corresponding values
labels = ['less than 30','30-100','100-250','250-500','more than 500']
values = [first,second,third,forth,fifth]
#draw the pie chart and adding layout, which in this pie chart, title is enough
trace = go.Pie(labels=labels, values=values)
layout = go.Layout(title='Number of reviews for all restaurants in West USA')
#go to the entire figure and show the plot
fig = go.Figure(data=[trace], layout=layout)
plotly.offline.iplot(fig)

plt.clf()
fig = plt.figure(figsize=[10,8])
ax1=fig.add_subplot(1,1,1)
ax1.set(title = 'Number Of Reviews Business Received',xlabel='Count of Review',ylabel='Number')
ax1.set_xlim(0,200)
# Calculate the mean of number of reviews of business received.
ax1.axvline(restaurant['review_count'].mean(),color='r',linestyle='dashed')
_ = ax1.hist(restaurant['review_count'].values,bins=600,color='lightblue',edgecolor='black',alpha=0.3)
plt.show()
Apparently, most of the restaurants have number of review less than 30 or from 30 to 100. Only 3.53% of the restaurants are popular enough to receive over 500 reviews from customers.
In this part, we will take a look at numbers of reviews of differents stars.
sql='''
SELECT review.stars,COUNT(*) AS Count
FROM review
JOIN restaurants
ON review.business_id = restaurants.id
GROUP BY review.stars
ORDER BY review.stars DESC;
'''
stars_r=pd.read_sql(sql,mydb)
stars_r
plt.clf()
fig = plt.figure(figsize=[8,8])
ax3=fig.add_subplot(1,1,1)
ax3.set(title = 'Stars Of Reviews',xlabel='Stars',ylabel='No. Of Reviews')
_ = ax3.bar(np.arange(len(stars_r)), height=np.array(stars_r['Count']),color='lightblue',width=0.3)
# Set the X label
ax3.set_xticks(np.arange(len(stars_r)))
ax3.set_xticklabels(stars_r['stars'])
The bar charts below shows that most of reviews are of 5 stars or 4 stars. Reviews of less than 4 stars are less.
plt.show()
sql='''
SELECT average_stars
FROM user;
'''
stars_user = pd.read_sql(sql,mydb)
plt.clf()
fig = plt.figure(figsize=[10,10])
ax5=fig.add_subplot(1,1,1)
ax5.set(title = "Histogram Of Users' average stars",xlabel='average stars',ylabel='Number')
ax5.set_xlim(1,5)
# Calculate the average stars of users.
ax5.axvline(stars_user['average_stars'].mean(),color='r',linestyle='dashed')
_ = ax5.hist(stars_user['average_stars'].values,bins=50,color='lightblue',edgecolor='black',alpha=0.3)
plt.show()
Based on two graphs above, we can conclude that most users are willing to rate in high stars, as long as restaurants could satisfy their requirements.
reviewuser = pd.read_sql('select id, review_count from user', mydb)
#prepare for the count
first1 = 0 #first1 means the first segment for the pie chart.
second2 = 0
third3 = 0
forth4 = 0
fifth5 = 0
#start the count
for j in range(0,1183361):
if reviewuser.review_count[j] <= 5:
first1 = first1 + 1
elif reviewuser.review_count[j] <= 15:
second2 = second2 + 1
elif reviewuser.review_count[j] <= 50:
third3 = third3 + 1
elif reviewuser.review_count[j] <= 100:
forth4 = forth4 + 1
else:
fifth5 = fifth5 + 1
#use plotly to do offline drawing at notebook
plotly.offline.init_notebook_mode(connected=True)
#create labels and corresponding values
labels1 = ['less than 5','5-15','15-50','50-100','more than 100']
values1 = [first1,second2,third3,forth4,fifth5]
#draw the pie chart and adding layout, which in this pie chart, title is enough
trace1 = go.Pie(labels=labels1, values=values1)
layout1 = go.Layout(title='Number of reviews for all yelp users in USA')
#go to the entire figure and show the plot
fig1 = go.Figure(data=[trace1], layout=layout1)
plotly.offline.iplot(fig1)

plt.clf()
fig = plt.figure(figsize=[8,8])
ax4=fig.add_subplot(1,1,1)
ax4.set(title = 'Number Of Reviews Users Sent',xlabel='Count of Review',ylabel='Number')
ax4.set_xlim(0,150)
# Calculate the average number of reviews users sent
ax4.axvline(reviewuser['review_count'].mean(),color='r',linestyle='dashed')
_ = ax4.hist(reviewuser['review_count'].values,bins=1000,color='lightblue',edgecolor='black',alpha=0.3)
plt.show()
fans = pd.read_sql('select id, fans from user', mydb)
#prepare the count
first11 = 0 #first1 means the first segment for the pie chart.
second22 = 0
third33 = 0
forth44 = 0
#start to count
for j in range(0,1183361):
if fans.fans[j] <= 0:
first11 = first11 + 1
elif fans.fans[j] <= 5:
second22 = second22 + 1
elif fans.fans[j] <= 100:
third33 = third33 + 1
else:
forth44 = forth44 + 1
#use plotly to do offline drawing at notebook
plotly.offline.init_notebook_mode(connected=True)
#create labels and corresponding values
labels2 = ['0','0-5','5-100','more than 100']
values2 = [first11,second22,third33,forth44]
#draw the pie chart and adding layout, which in this pie chart, title is enough
trace2 = go.Pie(labels=labels2, values=values2)
layout2 = go.Layout(title='Number of fans for all yelp users in USA')
#go to the entire figure and show the plot
fig2 = go.Figure(data=[trace2], layout=layout2)
plotly.offline.iplot(fig2)

First, wordcloud package needs to be installed.
!pip install wordcloud
If this command doesn't work, please go through the following steps:
sql='''
SELECT text
FROM review
JOIN restaurants
ON review.business_id = restaurants.id AND review.stars=1;
'''
onestar=pd.read_sql(sql,mydb)
onestar.head(5)
from wordcloud import WordCloud,STOPWORDS
#generate the total stopwords based on the given stopwords set
stopwords = set(STOPWORDS)
stopwords.add("said")
stopwords.add("one")
stopwords.add("go")
stopwords.add("thru")
stopwords.add("also")
wordcloud = WordCloud(background_color="white",width=2000,height=2000,stopwords=stopwords,margin=2).generate(' '.join(onestar['text']))
#display the wordcloud
plt.clf()
plt.figure(figsize=[10,10])
plt.title("Word Cloud of Reviews for West USA Restaurant with 1 Stars")
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
sql='''
SELECT text
FROM review
JOIN restaurants
ON review.business_id = restaurants.id AND review.stars=2;
'''
twostar=pd.read_sql(sql,mydb)
len(twostar)
wordcloud = WordCloud(background_color="white",width=2000,height=2000,stopwords=stopwords,margin=2).generate(' '.join(twostar['text']))
#display the wordcloud
plt.clf()
plt.figure(figsize=[10,10])
plt.title("Word Cloud of Reviews for West USA Restaurant with 2 Stars")
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
sql='''
SELECT text
FROM review
JOIN restaurants
ON review.business_id = restaurants.id AND review.stars=3;
'''
threestar=pd.read_sql(sql,mydb)
wordcloud = WordCloud(background_color="white",width=2000,height=2000,stopwords=stopwords,margin=2).generate(' '.join(threestar['text']))
#display the wordcloud
plt.clf()
plt.figure(figsize=[10,10])
plt.title("Word Cloud of Reviews for West USA Restaurant with 3 Stars")
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
sql='''
SELECT text
FROM review
JOIN restaurants
ON review.business_id = restaurants.id AND review.stars=4;
'''
fourstar=pd.read_sql(sql,mydb)
wordcloud = WordCloud(background_color="white",width=2000,height=2000,stopwords=stopwords,margin=2).generate(' '.join(fourstar['text']))
#display the wordcloud
plt.clf()
plt.figure(figsize=[10,10])
plt.title("Word Cloud of Reviews for West USA Restaurant with 4 Stars")
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
sql='''
SELECT text
FROM review
JOIN restaurants
ON review.business_id = restaurants.id AND review.stars=5;
'''
fivestar=pd.read_sql(sql,mydb)
wordcloud = WordCloud(background_color="white",width=2000,height=2000,stopwords=stopwords,margin=2).generate(' '.join(fivestar['text']))
#display the wordcloud
plt.clf()
plt.figure(figsize=[10,10])
plt.title("Word Cloud of Reviews for West USA Restaurant with 5 Stars")
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
The wordcloud above indicates customer service plays an signficant roles in reviews of less than 2 stars. If we want to get reviews of more than 3 stars, it is essential to improve food.
#Select the tips which belongs to the restaurants with less than 2 stars and located in West USA from tip table.
tiplessthan2 = pd.read_sql('select tip.text from tip, restaurants where tip.business_id = restaurants.id and restaurants.stars <= 2',mydb)
#Connect all the tips and save into a string called 'ti'.
ti = ' '
for i in range(0,8545,1):
ti = ti + tiplessthan2.text[i]
from os import path
from PIL import Image
import numpy as np
#We set a burger meal as the mask picture.
mask = np.array(Image.open("new1.jpg"))
#set up the background and mask figure, add the stopwords filter
#in this picture we only want to see the most frequent 200 words
wc = WordCloud(background_color="white", max_words=200,mask=mask,
stopwords=stopwords)
# generate word cloud
wc.generate(ti)
# show the wordcloud figure and add title
plt.figure(figsize=[10,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.title("Word Cloud of Tips for West USA Restaurant with Star Less Than 2")
plt.show()
#Select the tips which belongs to the restaurants with 2 stars to 4 stars and located in West USA from tip table.
tip2to4 = pd.read_sql('select tip.text from tip, restaurants where tip.business_id = restaurants.id and restaurants.stars > 2 and restaurants.stars <= 4',mydb)
#Connect all the tips and save into a string called 'tii'.
tii = ' '
for i in range(0,379913,1):
tii = tii + tip2to4.text[i]
#set up the background and mask figure, add the stopwords filter
#in this picture we only want to see the most frequent 200 words
wc = WordCloud(background_color="white", max_words=200,mask=mask,
stopwords=stopwords)
# generate word cloud
wc.generate(tii)
# show the wordcloud figure and add title
plt.figure(figsize=[10,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.title("Word Cloud of Tips for West USA Restaurant with Star From 2 To 4")
plt.show()
By comparing these two wordcloud pictures, there are some conclusions we can draw from that.
In figure 'star 2-4', some words like 'great service', 'good food', 'happy hour' and 'food great' are the most obvious words. Some like 'best', 'love', 'try', 'delcious', 'amazing' and 'happy hour' are the second noticable words. Look even closer, it is hard to see some negative tips.
In contrast, for the figure 'star less than 2', although some of the word are the same, like 'food', 'location', 'place' and 'drive'. None of them are assciate with a positive adjective so it is hard for us to know whether these are complains or praises. Look into more details, we saw some words like 'worst', 'bad', 'never', 'wait' and 'horrible'. We inferred that some people may be unhappy about the survice and food, or get annoyed to wait for a longer time, which make a restaurant become a 2-star one.
In this part, we want to know what tips are given to the restaurants with star less than 2 from a suggestion view. We want to find what kind of charactors lead to complains of customers.
#add some positive wordson the given stopwords set
stopwords = set(STOPWORDS)
stopwords.add("food")
stopwords.add("good")
stopwords.add("place")
stopwords.add("order")
stopwords.add("great")
stopwords.add("chicken")
stopwords.add("pizza")
stopwords.add("love")
stopwords.add("best")
#set up the background and mask figure, add the stopwords filter
#in this picture we only want to see the most frequent 200 words
wc = WordCloud(background_color="white", max_words=150,mask=mask,
stopwords=stopwords)
# generate word cloud
wc.generate(ti)
# show the wordcloud figure and add title
plt.figure(figsize=[10,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.title("Word Cloud of Tips for West USA Restaurant with Star Less Than 2")
plt.show()
About the noun words, we found customers are actually more care about service, location and drink. About the verb, we saw drive, come, taste, want and wait. About the adjective and adverbial, we saw always, bad, slow, clean, never, worst and rude. In total, our preliminary suggestions for restaurants is to find a more accessible location, improve the customer service by being polite and patient to customers and increase the responding or serving speed. What's more, make your food delicious and make the environment tidy.
In this part, we want to see from monday to sunday and from morning to midnight, which period is more popular based on the checkin data in the checkin table. So the restaurant is able to forsee the potantial rush hour for a burst customer and make corresponding adjustment to the raw-material and labor-resourse.
According to the generally accepted knowledge, morning is 6:00-11:59, afternoon is 12:00-17:59, evening is 18:00-23:59, and night is 00:00-5:59.
#select the monday data
monday = pd.read_sql('select checkin.date from checkin,restaurants where checkin.business_id = restaurants.id and checkin.date like "Monday%"',mydb)
#create an empty list to contain all the time data
list11 = []
for i in monday.index:
list11.append(monday.date[i].split('-')[1])#because the format of date is 'Monday-00:00', need to be seperated by '-'
#create an empty list to eliminate ':'.
list12 = []
for i in monday.index:
list12.append(list11[i].replace(':',''))
#convert time to numeric is easy to compare in our case
#change the str to integer
list1 = [int(i) for i in list12]
monday['time'] = list1
#prepare to count; For example,morning1 means monday morning.
morning1 = 0
night1 = 0
evening1 = 0
afternoon1 = 0
#start to count
for i in monday.index:
if int(monday.time[i]) < 600 :
night1 = night1 + 1
elif int(monday.time[i]) < 1200:
morning1 = morning1 + 1
elif int(monday.time[i]) < 1800:
afternoon1 = afternoon1 + 1
elif int(monday.time[i]) < 2400:
evening1 = evening1 + 1
#for the following cells untill the ploting one is nearly the same.
#we need to run the cell bufore 7 times to get all data 7 days in a week
tuesday = pd.read_sql('select checkin.date from checkin,restaurants where checkin.business_id = restaurants.id and checkin.date like "Tuesday%"',mydb)
list21 = []
for i in tuesday.index:
list21.append(tuesday.date[i].split('-')[1])
list22 = []
for i in tuesday.index:
list22.append(list21[i].replace(':',''))
list2 = [int(i) for i in list22]
tuesday['time'] = list2
morning2 = 0
night2 = 0
evening2 = 0
afternoon2 = 0
for i in tuesday.index:
if int(tuesday.time[i]) < 600 :
night2 = night2 + 1
elif int(tuesday.time[i]) < 1200:
morning2 = morning2 + 1
elif int(tuesday.time[i]) < 1800:
afternoon2 = afternoon2 + 1
elif int(tuesday.time[i]) < 2400:
evening2 = evening2 + 1
#wednesday
wednesday = pd.read_sql('select checkin.date from checkin,restaurants where checkin.business_id = restaurants.id and checkin.date like "Wednesday%"',mydb)
list31 = []
for i in wednesday.index:
list31.append(wednesday.date[i].split('-')[1])
list32 = []
for i in wednesday.index:
list32.append(list31[i].replace(':',''))
list3 = [int(i) for i in list32]
wednesday['time'] = list3
morning3 = 0
night3 = 0
evening3 = 0
afternoon3 = 0
for i in wednesday.index:
if int(wednesday.time[i]) < 600 :
night3 = night3 + 1
elif int(wednesday.time[i]) < 1200:
morning3 = morning3 + 1
elif int(wednesday.time[i]) < 1800:
afternoon3 = afternoon3 + 1
elif int(wednesday.time[i]) < 2400:
evening3 = evening3 + 1
#thursday
thursday = pd.read_sql('select checkin.date from checkin,restaurants where checkin.business_id = restaurants.id and checkin.date like "Thursday%"',mydb)
list41 = []
for i in thursday.index:
list41.append(thursday.date[i].split('-')[1])
list42 = []
for i in thursday.index:
list42.append(list41[i].replace(':',''))
list4 = [int(i) for i in list42]
thursday['time'] = list4
morning4 = 0
night4 = 0
evening4 = 0
afternoon4 = 0
for i in thursday.index:
if int(thursday.time[i]) < 600 :
night4 = night4 + 1
elif int(thursday.time[i]) < 1200:
morning4 = morning4 + 1
elif int(thursday.time[i]) < 1800:
afternoon4 = afternoon4 + 1
elif int(thursday.time[i]) < 2400:
evening4 = evening4 + 1
#friday
friday = pd.read_sql('select checkin.date from checkin,restaurants where checkin.business_id = restaurants.id and checkin.date like "Friday%"',mydb)
list51 = []
for i in friday.index:
list51.append(friday.date[i].split('-')[1])
list52 = []
for i in friday.index:
list52.append(list51[i].replace(':',''))
list5 = [int(i) for i in list52]
friday['time'] = list5
morning5 = 0
night5 = 0
evening5 = 0
afternoon5 = 0
for i in friday.index:
if int(friday.time[i]) < 600 :
night5 = night5 + 1
elif int(friday.time[i]) < 1200:
morning5 = morning5 + 1
elif int(friday.time[i]) < 1800:
afternoon5 = afternoon5 + 1
elif int(friday.time[i]) < 2400:
evening5 = evening5 + 1
#saturday
saturday = pd.read_sql('select checkin.date from checkin,restaurants where checkin.business_id = restaurants.id and checkin.date like "Saturday%"',mydb)
list61 = []
for i in saturday.index:
list61.append(saturday.date[i].split('-')[1])
list62 = []
for i in saturday.index:
list62.append(list61[i].replace(':',''))
list6 = [int(i) for i in list62]
saturday['time'] = list6
morning6 = 0
night6 = 0
evening6 = 0
afternoon6 = 0
for i in saturday.index:
if int(saturday.time[i]) < 600 :
night6 = night6 + 1
elif int(saturday.time[i]) < 1200:
morning6 = morning6 + 1
elif int(saturday.time[i]) < 1800:
afternoon6 = afternoon6 + 1
elif int(saturday.time[i]) < 2400:
evening6 = evening6 + 1
#sunday
sunday = pd.read_sql('select checkin.date from checkin,restaurants where checkin.business_id = restaurants.id and checkin.date like "Sunday%"',mydb)
list71 = []
for i in sunday.index:
list71.append(sunday.date[i].split('-')[1])
list72 = []
for i in sunday.index:
list72.append(list71[i].replace(':',''))
list7 = [int(i) for i in list72]
sunday['time'] = list7
morning7 = 0
night7 = 0
evening7 = 0
afternoon7 = 0
for i in sunday.index:
if int(sunday.time[i]) < 600 :
night7 = night7 + 1
elif int(sunday.time[i]) < 1200:
morning7 = morning7 + 1
elif int(sunday.time[i]) < 1800:
afternoon7 = afternoon7 + 1
elif int(sunday.time[i]) < 2400:
evening7 = evening7 + 1
#put all count result into the relevant list
nightcount = [night1,night2,night3,night4,night5,night6,night7]#this means the night count from monday to sunday
morningcount = [morning1,morning2,morning3,morning4,morning5,morning6,morning7]
eveningcount = [evening1,evening2,evening3,evening4,evening5,evening6,evening7]
afternooncount = [afternoon1,afternoon2,afternoon3,afternoon4,afternoon5,afternoon6,afternoon7]
import plotly
plotly.offline.init_notebook_mode(connected=True)
heat = go.Heatmap(z=[morningcount,afternooncount,eveningcount,nightcount],
x=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday','Sunday'],
y=['Morning', 'Afternoon', 'Evening','Night'],)
data=[heat]
layout = go.Layout(title='Heatmap of Check-in Customers Weekly and Daily')
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)

From the color bar, color heat goes up when the number of checkin people is large and goes down when that is small. It is clear that in afternoon and morning, from 6:00-17:59 in total, is less popular than evening and night, from 18:00 to 5:59 of the next day. Especially, morning is the time period that the restaurant tend to be more empty from monday to sunday. Saturday evening is the most lively time slot.
Next, we will load data of top 50 and least 50 review count, stars and check-in count in both cities from our database.
Top 50 review_count
sql='''
SELECT latitude,longitude,review_count
FROM restaurants
WHERE state = 'AZ' and city = 'Phoenix'
ORDER BY review_count DESC
LIMIT 50;
'''
Top_review_AZ = pd.read_sql(sql,mydb)
Top_review_AZ.head()
Least 50 review_count
sql='''
SELECT latitude,longitude,review_count
FROM restaurants
WHERE state = 'AZ' and city = 'Phoenix'
ORDER BY review_count
LIMIT 50;
'''
Lea_review_AZ = pd.read_sql(sql,mydb)
Lea_review_AZ.head()
import matplotlib.pyplot as plt
Top 50 stars
sql='''
SELECT latitude,longitude,stars
FROM restaurants
WHERE state = 'AZ' and city = 'Phoenix'
ORDER BY stars DESC
LIMIT 50;
'''
Top_stars_AZ = pd.read_sql(sql,mydb)
Top_stars_AZ.head()
Least 50 stars
sql='''
SELECT latitude,longitude,stars
FROM restaurants
WHERE state = 'AZ' and city = 'Phoenix'
ORDER BY stars
LIMIT 50;
'''
Lea_stars_AZ = pd.read_sql(sql,mydb)
Lea_stars_AZ.head()
Top 50 checkin
sql='''
SELECT business_id, restaurants.latitude,restaurants.longitude,ROUND(AVG(checkin.count),2) AS checkin
FROM checkin
JOIN restaurants
ON restaurants.id = checkin.business_id AND
state = 'AZ' and city = 'Phoenix'
GROUP BY business_id,restaurants.latitude,restaurants.longitude
ORDER BY ROUND(AVG(checkin.count),2) DESC
LIMIT 50;
'''
Top_checkin_AZ = pd.read_sql(sql,mydb)
Top_checkin_AZ.head()
Least 50 checkin
sql='''
SELECT business_id, restaurants.latitude,restaurants.longitude,ROUND(AVG(checkin.count),2) AS checkin
FROM checkin
JOIN restaurants
ON restaurants.id = checkin.business_id AND
state = 'AZ' and city = 'Phoenix'
GROUP BY business_id,restaurants.latitude,restaurants.longitude
ORDER BY ROUND(AVG(checkin.count),2)
LIMIT 50;
'''
Lea_checkin_AZ = pd.read_sql(sql,mydb)
Lea_checkin_AZ.head()
Top 50 review_count
sql='''
SELECT latitude,longitude,review_count
FROM restaurants
WHERE state = 'NV' AND city = 'Las Vegas'
ORDER BY review_count DESC
LIMIT 50;
'''
Top_review_NV = pd.read_sql(sql,mydb)
Top_review_NV.head()
sql='''
SELECT latitude,longitude,review_count
FROM restaurants
WHERE state = 'NV' AND city = 'Las Vegas'
ORDER BY review_count
LIMIT 50;
'''
Lea_review_NV = pd.read_sql(sql,mydb)
Lea_review_NV.head()
Top 50 stars
sql='''
SELECT latitude,longitude,stars
FROM restaurants
WHERE state = 'NV' AND city = 'Las Vegas'
ORDER BY stars DESC
LIMIT 50;
'''
Top_stars_NV = pd.read_sql(sql,mydb)
Top_stars_NV.head()
Least 50 stars
sql='''
SELECT latitude,longitude,stars
FROM restaurants
WHERE state = 'NV' AND city = 'Las Vegas'
ORDER BY stars
LIMIT 50;
'''
Lea_stars_NV = pd.read_sql(sql,mydb)
Lea_stars_NV.head()
Top 50 checkin
sql='''
SELECT business_id, restaurants.latitude,restaurants.longitude,ROUND(AVG(checkin.count),2) AS checkin
FROM checkin
JOIN restaurants
ON restaurants.id = checkin.business_id AND
state = 'NV' AND city = 'Las Vegas'
GROUP BY business_id,restaurants.latitude,restaurants.longitude
ORDER BY ROUND(AVG(checkin.count),2) DESC
LIMIT 50;
'''
Top_checkin_NV = pd.read_sql(sql,mydb)
Top_checkin_NV.head()
Least 50 checkin
sql='''
SELECT business_id, restaurants.latitude,restaurants.longitude,ROUND(AVG(checkin.count),2) AS checkin
FROM checkin
JOIN restaurants
ON restaurants.id = checkin.business_id AND
state = 'NV' AND city = 'Las Vegas'
GROUP BY business_id,restaurants.latitude,restaurants.longitude
ORDER BY ROUND(AVG(checkin.count),2)
LIMIT 50;
'''
Lea_checkin_NV = pd.read_sql(sql,mydb)
Lea_checkin_NV.head()
Next, we will use matplotlib to plot 6 graphs on the data we get above.
plt.clf()
fig = plt.figure(figsize=[10,20])
bx1=fig.add_subplot(3,2,1)
bx1.set(title = 'Review Count By Location in Phoenix AZ',xlabel='Latitude',ylabel='Longitude')
bx1.scatter(Top_review_AZ['latitude'],Top_review_AZ['longitude'],color = 'red')
bx1.scatter(Lea_review_AZ['latitude'],Lea_review_AZ['longitude'],color = 'lightblue')
bx1.legend(['Top 50','Least 50'])
bx2=fig.add_subplot(3,2,3)
bx2.set(title = 'Stars By Location in Phoenix AZ',xlabel='Latitude',ylabel='Longitude')
bx2.scatter(Top_stars_AZ['latitude'],Top_stars_AZ['longitude'],color = 'red')
bx2.scatter(Lea_stars_AZ['latitude'],Lea_stars_AZ['longitude'],color = 'lightblue')
bx2.legend(['Top 50','Least 50'])
bx3=fig.add_subplot(3,2,5)
bx3.set(title = 'Checkin By Location in Phoenix AZ',xlabel='Latitude',ylabel='Longitude')
bx3.scatter(Top_checkin_AZ['latitude'],Top_checkin_AZ['longitude'],color = 'red')
bx3.scatter(Lea_checkin_AZ['latitude'],Lea_checkin_AZ['longitude'],color = 'lightblue')
bx3.legend(['Top 50','Least 50'])
bx4=fig.add_subplot(3,2,2)
bx4.set(title = 'Review Count By Location in Las Vegas NV',xlabel='Latitude',ylabel='Longitude')
bx4.scatter(Top_review_NV['latitude'],Top_review_NV['longitude'],color = 'red')
bx4.scatter(Lea_review_NV['latitude'],Lea_review_NV['longitude'],color = 'lightblue')
bx4.legend(['Top 50','Least 50'])
bx5=fig.add_subplot(3,2,4)
bx5.set(title = 'Stars By Location in Las Vegas NV',xlabel='Latitude',ylabel='Longitude')
bx5.scatter(Top_stars_NV['latitude'],Top_stars_NV['longitude'],color = 'red')
bx5.scatter(Lea_stars_NV['latitude'],Lea_stars_NV['longitude'],color = 'lightblue')
bx5.legend(['Top 50','Least 50'])
bx6=fig.add_subplot(3,2,6)
bx6.set(title = 'Checkin By Location in Las Vegas NV',xlabel='Latitude',ylabel='Longitude')
bx6.scatter(Top_checkin_NV['latitude'],Top_checkin_NV['longitude'],color = 'red')
bx6.scatter(Lea_checkin_NV['latitude'],Lea_checkin_NV['longitude'],color = 'lightblue')
bx6.legend(['Top 50','Least 50'])
plt.show()
The graph above shows that restaurants of top 50 review count and checkin are usually located on some specific locations. Even so, there are restaurants of least 50 on the same locations. It may be due to other factors, like customer service or food.
Phoenix, AZ
sql='''
SELECT latitude,longitude,review_count
FROM restaurants
WHERE state = 'AZ' and city = 'Phoenix'
ORDER BY review_count DESC;
'''
reviewcount_AZ = pd.read_sql(sql,mydb)
reviewcount_AZ.head()
reviewcount_AZ_y=reviewcount_AZ['review_count']
reviewcount_AZ_x = reviewcount_AZ
del reviewcount_AZ_x['review_count']
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats
# Model and check its results.
X2 = sm.add_constant(reviewcount_AZ_x)
est = sm.OLS(reviewcount_AZ_y, X2)
est2 = est.fit()
print(est2.summary())
sql='''
SELECT latitude,longitude,stars
FROM restaurants
WHERE state = 'AZ' and city = 'Phoenix'
ORDER BY stars DESC;
'''
stars_AZ = pd.read_sql(sql,mydb)
stars_AZ_y = stars_AZ['stars']
stars_AZ_x = stars_AZ[['latitude','longitude']]
X2 = sm.add_constant(stars_AZ_x)
est = sm.OLS(stars_AZ_y, X2)
est2 = est.fit()
print(est2.summary())
sql='''
SELECT business_id, restaurants.latitude,restaurants.longitude,ROUND(AVG(checkin.count),2) AS checkin
FROM checkin
JOIN restaurants
ON restaurants.id = checkin.business_id AND
state = 'AZ' AND city = 'Phoenix'
GROUP BY business_id,restaurants.latitude,restaurants.longitude
ORDER BY ROUND(AVG(checkin.count),2) DESC;
'''
checkin_AZ = pd.read_sql(sql,mydb)
checkin_AZ_y = checkin_AZ['checkin']
checkin_AZ_x = checkin_AZ[['latitude','longitude']]
X2 = sm.add_constant(checkin_AZ_x)
est = sm.OLS(checkin_AZ_y, X2)
est2 = est.fit()
print(est2.summary())
Las Vegas, NV
sql='''
SELECT latitude,longitude,review_count
FROM restaurants
WHERE state = 'NV' AND city = 'Las Vegas'
ORDER BY review_count DESC;
'''
reviewcount_NV = pd.read_sql(sql,mydb)
reviewcount_NV.head()
reviewcount_NV_y=reviewcount_NV['review_count']
reviewcount_NV_x = reviewcount_NV
del reviewcount_NV_x['review_count']
X2 = sm.add_constant(reviewcount_NV_x)
est = sm.OLS(reviewcount_NV_y, X2)
est2 = est.fit()
print(est2.summary())
sql='''
SELECT latitude,longitude,stars
FROM restaurants
WHERE state = 'NV' AND city = 'Las Vegas'
ORDER BY stars DESC;
'''
stars_NV = pd.read_sql(sql,mydb)
stars_NV_y = stars_NV['stars']
stars_NV_x = stars_NV[['latitude','longitude']]
X2 = sm.add_constant(stars_AZ_x)
est = sm.OLS(stars_AZ_y, X2)
est2 = est.fit()
print(est2.summary())
sql='''
SELECT business_id, restaurants.latitude,restaurants.longitude,ROUND(AVG(checkin.count),2) AS checkin
FROM checkin
JOIN restaurants
ON restaurants.id = checkin.business_id AND
state = 'NV' AND city = 'Las Vegas'
GROUP BY business_id,restaurants.latitude,restaurants.longitude
ORDER BY ROUND(AVG(checkin.count),2) DESC;
'''
checkin_NV = pd.read_sql(sql,mydb)
checkin_NV_y = checkin_NV['checkin']
checkin_NV_x = checkin_NV[['latitude','longitude']]
X2 = sm.add_constant(checkin_NV_x)
est = sm.OLS(checkin_NV_y, X2)
est2 = est.fit()
print(est2.summary())
The results of linear models above shows that in Phoenix, review count and check-in count are more related to location. In both cities, stars are rarely related to location
We know that location is usually related to population density. For example, some shopping malls usually located on downtwon, which attracts citizens. Therefore, we will use webscrapping to load data of population density.
Phoenix, AZ
import urllib.request
from bs4 import BeautifulSoup as bs
url = "http://zipatlas.com/us/az/phoenix/zip-code-comparison/population-density.htm"
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
data = response.read()
response.close()
# Create the soup
soup = bs(data, "html.parser")
# ---- Now we start parsing the table
bigtables = soup.findAll('div')
myLists = []
for bigtable in bigtables:
tables = bigtable.findAll('table')
for table in tables:
rows = table.findAll('tr')
r = []
for row in rows:
cells = row.findAll('td')
c = []
for cell in cells:
c.append(cell.getText())
r.append(c)
myLists.append(r)
print (myLists)
myLists[4]
num=[]
postalcode=[]
lat=[]
lon=[]
city = []
population=[]
density=[]
nationalrank=[]
for i in np.arange(1,len(myLists[4])):
for j in np.arange(len(myLists[4][i])):
if j == 0:
num.append(myLists[4][i][j])
elif j == 1:
postalcode.append(int(myLists[4][i][j]))
elif j == 2:
lat.append(float(myLists[4][i][j].split(",")[0]))
lon.append(float(myLists[4][i][j].split(",")[1]))
elif j == 3:
city.append(myLists[4][i][j])
elif j == 4:
if len(myLists[4][i][j].split(",")) == 2:
population.append(int(myLists[4][i][j].split(",")[0])*1000 + int(myLists[4][i][j].split(",")[1]))
elif len(myLists[4][i][j].split(",")) == 1:
population.append(int(myLists[4][i][j]))
elif j == 5:
if len(myLists[4][i][j].split(",")) == 2:
density.append(float(myLists[4][i][j].split(",")[0])*1000 + float(myLists[4][i][j].split(",")[1]))
elif len(myLists[4][i][j].split(",")) == 1:
density.append(float(myLists[4][i][j]))
elif j == 6:
nationalrank.append(myLists[4][i][j])
density_AZ = pd.DataFrame(postalcode)
# Stack them as a new dataframe
density_AZ['latitude']=lat
density_AZ['longitude']=lon
density_AZ['city']='Phoenix'
density_AZ['state']='AZ'
density_AZ['population']=population
density_AZ['density']=density
len(density_AZ)
Las Vegas, NV
url = "http://zipatlas.com/us/nv/las-vegas/zip-code-comparison/population-density.htm"
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
data = response.read()
response.close()
# Create the soup
soup = bs(data, "html.parser")
# ---- Now we start parsing the table
bigtables = soup.findAll('div')
myLists = []
for bigtable in bigtables:
tables = bigtable.findAll('table')
for table in tables:
rows = table.findAll('tr')
r = []
for row in rows:
cells = row.findAll('td')
c = []
for cell in cells:
c.append(cell.getText())
r.append(c)
myLists.append(r)
print (myLists)
len(myLists)
num=[]
postalcode=[]
lat=[]
lon=[]
city = []
population=[]
density=[]
nationalrank=[]
for i in np.arange(1,len(myLists[4])):
for j in np.arange(len(myLists[4][i])):
if j == 0:
num.append(myLists[4][i][j])
elif j == 1:
postalcode.append(int(myLists[4][i][j]))
elif j == 2:
lat.append(float(myLists[4][i][j].split(",")[0]))
lon.append(float(myLists[4][i][j].split(",")[1]))
elif j == 3:
city.append(myLists[4][i][j])
elif j == 4:
if len(myLists[4][i][j].split(",")) == 2:
population.append(int(myLists[4][i][j].split(",")[0])*1000 + int(myLists[4][i][j].split(",")[1]))
elif len(myLists[4][i][j].split(",")) == 1:
population.append(int(myLists[4][i][j]))
elif j == 5:
if len(myLists[4][i][j].split(",")) == 2:
density.append(float(myLists[4][i][j].split(",")[0])*1000 + float(myLists[4][i][j].split(",")[1]))
elif len(myLists[4][i][j].split(",")) == 1:
density.append(float(myLists[4][i][j]))
elif j == 6:
nationalrank.append(myLists[4][i][j])
density_NV = pd.DataFrame(postalcode)
# Stack them as a new dataframe
density_NV['latitude']=lat
density_NV['longitude']=lon
density_NV['city']='Las Vegas'
density_NV['state']='NV'
density_NV['population']=population
density_NV['density']=density
len(density_NV)
Next, we will plot using information of population density and data from yelp.
from PIL import Image
import urllib
import io
import geocoder
# Source: http://www.jianshu.com/p/1abcc4595829
# The original code was based on python 2.X and I changed it into version on python 3.X
# The function to get image of map according to latitude and longitude.
def Gmap(centerLat,centerLon,zoomS,pixelS,size,dark,saveAddress):
url = 'http://maps.googleapis.com/maps/api/staticmap?sensor=false'\
+'&size='+str(size)+'x'+str(size)+'¢er='+str(centerLat)+','\
+str(centerLon)+'&zoom='+str(zoomS)+'&scale='+str(pixelS)\
+'&maptype=terrain'
if dark==True:
url = url+'&style=feature:all|element:all|saturation:-10|lightness:20'
buffer = io.BytesIO(urllib.request.urlopen(url).read())
image = Image.open(buffer)
if saveAddress:
image.save(saveAddress)
else:
image.show()
#The function to convert latitude and longitude into location of pixel in the picture.
def latLonToPixelXY(lat,lon,zoomS):
mapW = 256*2**zoomS+0.0
mapH = 256*2**zoomS+0.0
x = (lon+180)*(mapW/360)# get x value
latRad = lat*np.pi/180# convert from degrees to radians
mercN = np.log(np.tan((np.pi/4)+(latRad/2)))# get y value
y = (mapH/2)-(mapW*mercN/(2*np.pi))
return x,y
#Use geocoder to get information of latitude and longitude of Phoenix AZ. Sometimes it may not work. Please run it again.
g = geocoder.google('Phoenix AZ USA')
centerLat,centerLon = (g.lat, g.lng); scale = 11; pixelS = 2; size = 640
import matplotlib.patches as mpatches
red_patch = mpatches.Patch(facecolor='Red', label='Top 50')
blue_patch = mpatches.Patch(facecolor='RoyalBlue', label='Least 50')
yellow_patch = mpatches.Patch(facecolor='Yellow', label='PopulationDensity')
The code below need to use path to store image, so please define your path
path = 'd:'
#If error happens, it should be due to the code in last cell does not work.
Gmap(centerLat,centerLon,scale,pixelS,size,True, path+'phoenix.png')
Phoenix AZ
Review_count
#If error happens, it should be due to the geocoding code does not work.
centX,centY = latLonToPixelXY(centerLat,centerLon,scale)
plt.clf()
fig = plt.figure(figsize=(10, 10))
im = np.flipud(plt.imread(path+'phoenix.png'))
ax = plt.subplot(111)
ax.imshow(im, origin='lower')
#Draw scatter of population density
for i in density_AZ.index:
lat = density_AZ.loc[i,'latitude']
lon = density_AZ.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,s=density_AZ.loc[i,'density']/50,facecolor='Yellow',lw=1,alpha=0.7,edgecolors='black')
#Draw scatter of top 50
for i in Top_review_AZ.index:
lat = Top_review_AZ.loc[i,'latitude']
lon = Top_review_AZ.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='Red',lw=1,alpha=0.7,edgecolors='black')
#Draw scatter of least 50
for i in Lea_review_AZ.index:
lat = Lea_review_AZ.loc[i,'latitude']
lon = Lea_review_AZ.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='RoyalBlue',lw=1,alpha=0.7,edgecolors='black')
#Display
ax.set_xlim(0,size*pixelS)
ax.set_ylim(0,size*pixelS)
plt.legend(handles=[red_patch,blue_patch,yellow_patch])
ax.set(title = 'Review Count and Population Density By Location in Phoenix AZ')
plt.axis('off')
plt.show()
Stars
centX,centY = latLonToPixelXY(centerLat,centerLon,scale)
plt.clf()
fig = plt.figure(figsize=(10, 10))
im = np.flipud(plt.imread(path+'phoenix.png'))
ax = plt.subplot(111)
ax.imshow(im, origin='lower')
for i in density_AZ.index:
lat = density_AZ.loc[i,'latitude']
lon = density_AZ.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,s=density_AZ.loc[i,'density']/50,facecolor='Yellow',lw=1,alpha=0.7,edgecolors='black')
for i in Top_stars_AZ.index:
lat = Top_stars_AZ.loc[i,'latitude']
lon = Top_stars_AZ.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='Red',lw=1,alpha=0.7,edgecolors='black')
for i in Lea_stars_AZ.index:
lat = Lea_stars_AZ.loc[i,'latitude']
lon = Lea_stars_AZ.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='RoyalBlue',lw=1,alpha=0.7,edgecolors='black')
ax.set_xlim(0,size*pixelS)
ax.set_ylim(0,size*pixelS)
plt.legend(handles=[red_patch,blue_patch,yellow_patch])
ax.set(title = 'Stars and Population Density By Location in Phoenix AZ')
plt.axis('off')
plt.show()
Check in
centX,centY = latLonToPixelXY(centerLat,centerLon,scale)
fig = plt.figure(figsize=(10, 10))
im = np.flipud(plt.imread(path+'phoenix.png'))
ax = plt.subplot(111)
ax.imshow(im, origin='lower')
for i in density_AZ.index:
lat = density_AZ.loc[i,'latitude']
lon = density_AZ.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,s=density_AZ.loc[i,'density']/50,facecolor='Yellow',lw=1,alpha=0.7,edgecolors='black')
for i in Top_checkin_AZ.index:
lat = Top_checkin_AZ.loc[i,'latitude']
lon = Top_checkin_AZ.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='Red',lw=1,alpha=0.7,edgecolors='black')
for i in Lea_checkin_AZ.index:
lat = Lea_checkin_AZ.loc[i,'latitude']
lon = Lea_checkin_AZ.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='RoyalBlue',lw=1,alpha=0.7,edgecolors='black')
ax.set_xlim(0,size*pixelS)
ax.set_ylim(0,size*pixelS)
plt.legend(handles=[red_patch,blue_patch,yellow_patch])
ax.set(title = 'Checkin and Population Density By Location in Phoenix AZ')
plt.axis('off')
plt.show()
Las Vegas NV
#Use geocoder to get information of latitude and longitude of Phoenix AZ. Sometimes it may not work. Please run it again.
g = geocoder.google('Las Vegas NV USA')
centerLat,centerLon = (g.lat, g.lng); scale = 11; pixelS = 2; size = 640
#If error happens, it should be due to the code in last cell does not work.
Gmap(centerLat,centerLon,scale,pixelS,size,True, path+'lasvegas.png')
Review_count
#If error happens, it should be due to the geocoding code does not work.
centX,centY = latLonToPixelXY(centerLat,centerLon,scale)
fig = plt.figure(figsize=(10, 10))
im = np.flipud(plt.imread(path+'lasvegas.png'))
ax = plt.subplot(111)
ax.imshow(im, origin='lower')
for i in density_NV.index:
lat = density_NV.loc[i,'latitude']
lon = density_NV.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,s=density_NV.loc[i,'density']/50,facecolor='Yellow',lw=1,alpha=0.7,edgecolors='black')
for i in Top_review_NV.index:
lat = Top_review_NV.loc[i,'latitude']
lon = Top_review_NV.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='Red',lw=1,alpha=0.7,edgecolors='black')
for i in Lea_review_NV.index:
lat = Lea_review_NV.loc[i,'latitude']
lon = Lea_review_NV.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='RoyalBlue',lw=1,alpha=0.7,edgecolors='black')
ax.set_xlim(0,size*pixelS)
ax.set_ylim(0,size*pixelS)
plt.legend(handles=[red_patch,blue_patch,yellow_patch])
ax.set(title = 'Review Count and Population Density By Location in Las Vegas NV')
plt.axis('off')
plt.show()
Stars
centX,centY = latLonToPixelXY(centerLat,centerLon,scale)
fig = plt.figure(figsize=(10, 10))
im = np.flipud(plt.imread(path+'lasvegas.png'))
ax = plt.subplot(111)
ax.imshow(im, origin='lower')
for i in density_NV.index:
lat = density_NV.loc[i,'latitude']
lon = density_NV.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,s=density_NV.loc[i,'density']/50,facecolor='Yellow',lw=1,alpha=0.7,edgecolors='black')
for i in Top_stars_NV.index:
lat = Top_stars_NV.loc[i,'latitude']
lon = Top_stars_NV.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='Red',lw=1,alpha=0.7,edgecolors='black')
for i in Lea_stars_NV.index:
lat = Lea_stars_NV.loc[i,'latitude']
lon = Lea_stars_NV.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='RoyalBlue',lw=1,alpha=0.7,edgecolors='black')
ax.set_xlim(0,size*pixelS)
ax.set_ylim(0,size*pixelS)
plt.legend(handles=[red_patch,blue_patch,yellow_patch])
ax.set(title = 'Stars and Population Density By Location in Las Veags NV')
plt.axis('off')
plt.show()
Check in
centX,centY = latLonToPixelXY(centerLat,centerLon,scale)
fig = plt.figure(figsize=(10, 10))
im = np.flipud(plt.imread(path+'lasvegas.png'))
ax = plt.subplot(111)
ax.imshow(im, origin='lower')
for i in density_NV.index:
lat = density_NV.loc[i,'latitude']
lon = density_NV.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,s=density_NV.loc[i,'density']/50,facecolor='Yellow',lw=1,alpha=0.7,edgecolors='black')
for i in Top_checkin_NV.index:
lat = Top_checkin_NV.loc[i,'latitude']
lon = Top_checkin_NV.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='Red',lw=1,alpha=0.7,edgecolors='black')
for i in Lea_checkin_NV.index:
lat = Lea_checkin_NV.loc[i,'latitude']
lon = Lea_checkin_NV.loc[i,'longitude']
x,y = latLonToPixelXY(float(lat),float(lon),scale)
x,y = size*pixelS/2 + x - centX, size*pixelS/2 - (y - centY)
ax.scatter(x,y,facecolor='RoyalBlue',lw=1,alpha=0.7,edgecolors='black')
ax.set_xlim(0,size*pixelS)
ax.set_ylim(0,size*pixelS)
plt.legend(handles=[red_patch,blue_patch,yellow_patch])
ax.set(title = 'Checkin and Population Density By Location in Las Vegas NV')
plt.axis('off')
plt.show()
from pandas.io.json import json_normalize
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
fn = 'c:/business.json'
s=[]
for line in open(fn, 'r'):
s.append(json.loads(line))
len(s)
s[0]
df1 = pd.DataFrame.from_dict(json_normalize(s), orient='columns')
df1.head(5)
df2=df1.dropna(axis=1,thresh=39159)
df2.head()
del df2["address"]
del df2["business_id"]
del df2["categories"]
del df2["city"]
del df2["hours.Friday"]
del df2["hours.Monday"]
del df2["hours.Saturday"]
del df2["hours.Sunday"]
del df2["hours.Thursday"]
del df2["hours.Tuesday"]
del df2["hours.Wednesday"]
del df2["is_open"]
del df2["latitude"]
del df2["longitude"]
del df2["name"]
del df2["neighborhood"]
del df2["postal_code"]
del df2["state"]
df2.head()
df3=df2.dropna(thresh=18)
df3.head()
#find Number of nan in every attributes
from __future__ import division
print (df3.shape)
df3.isnull().sum(axis=0)/df3.shape[0]
#results show a lot of nan value
from sklearn.preprocessing import LabelEncoder
#use -1 to make missing value a separate category
df3.fillna(-1, inplace=True)
le = LabelEncoder()
for col in df3.columns:
df3.loc[:, col] = le.fit_transform(df3.loc[:, col].astype(str))
from sklearn.model_selection import StratifiedKFold,cross_val_score
from sklearn import linear_model, ensemble
skf = StratifiedKFold(n_splits=3)
rfc = ensemble.RandomForestClassifier(n_estimators=100, criterion='entropy',
max_depth=None, min_samples_split=2,
min_samples_leaf=4,
min_weight_fraction_leaf=0.0,
max_features='auto',
max_leaf_nodes=None,
bootstrap=True,
oob_score=False,
n_jobs=1)
for train_index, test_index in skf.split(df3.iloc[:, :-1],
le.fit_transform(df3.iloc[:, -1])):
rfc.fit(df3.iloc[train_index, :-1],
df3.iloc[train_index, -1])
print (rfc.score(df3.iloc[test_index, :-1],
df3.iloc[test_index, -1]))
fig = plt.figure(figsize=(10,10))
plt.barh(range(df3.shape[1]-1) ,rfc.feature_importances_,
tick_label=df3.columns[:-1])
The result above shows that the Wifi, Noise Level, Bike Parking and alcohol are important factors.